Learning N-Gram Language Models from Uncertain Data

نویسندگان

  • Vitaly Kuznetsov
  • Hank Liao
  • Mehryar Mohri
  • Michael Riley
  • Brian Roark
چکیده

We present a new algorithm for efficiently training n-gram language models on uncertain data, and illustrate its use for semisupervised language model adaptation. We compute the probability that an n-gram occurs k times in the sample of uncertain data, and use the resulting histograms to derive a generalized Katz back-off model. We compare three approaches to semisupervised adaptation of language models for speech recognition of selected YouTube video categories: (1) using just the one-best output from the baseline speech recognizer or (2) using samples from lattices with standard algorithms versus (3) using full lattices with our new algorithm. Unlike the other methods, our new algorithm provides models that yield solid improvements over the baseline on the full test set, and, further, achieves these gains without hurting performance on any of the set of video categories. We show that categories with the most data yielded the largest gains. The algorithm has been released as part of the OpenGrm n-gram library [1].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning from Uncertain Data

The application of statistical methods to natural language processing has been remarkably successful over the past two decades. But, to deal with recent problems arising in this field, machine learning techniques must be generalized to deal with uncertain data, or datasets whose elements are distributions over sequences, such as weighted automata. This paper reviews a number of recent results r...

متن کامل

Learning Representations for Weakly Supervised Natural Language Processing Tasks

Finding the right representations for words is critical for building accurate NLP systems when domain-specific labeled data for the task is scarce. This article investigates novel techniques for extracting features from n-gram models, Hidden Markov Models, and other statistical language models, including a novel Partial Lattice Markov Random Field model. Experiments on partof-speech tagging and...

متن کامل

Language Modeling for limited-data domains

With the increasing focus of speech recognition and natural language processing applications on domains with limited amount of in-domain training data, enhanced system performance often relies on approaches involving model adaptation and combination. In such domains, language models are often constructed by interpolating component models trained from partially matched corpora. Instead of simple...

متن کامل

Program Synthesis for Character Level Language Modeling

We propose a statistical model applicable to character level language modeling and show that it is a good fit for both, program source code and English text. The model is parameterized by a program from a domain-specific language (DSL) that allows expressing non-trivial data dependencies. Learning is done in two phases: (i) we synthesize a program from the DSL, essentially learning a good repre...

متن کامل

Mining of association patterns for language modeling

Language modeling using n-gram is popular for speech recognition and many other applications. The conventional ngram suffers from the insufficiencies of training data, domain knowledge and long distance language dependencies. This paper presents a new approach to mining long distance word associations and incorporating their mutual information into language models. We aim to discover the associ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016